Programming
Running some stuff through ssh with python using the fabric library (see here) Measuring using profiling tools is critical. R already provides the basic tools for performance analysis. *the'' system.time'' function for simple measurements. *the Rprof function for profiling R code. *the Rprofmem function for profiling R memory usage. In addition, the profr and proftools package on CRAN can be used to visualize Rprof data. rbenchmark''is useful for comparisons. Profiling compiled code in general and using Google’s ''perftools. Basic rules for speeding up: *vectorization instead of loops *just-in-time compilation (jit package) of loops and arithmetic expressions *use of BLAS (’basic linear algebra subprogram’), R can be built using so-called optimised BLAS such as Atlas (open source), Goto (not ’free’), or the Intel MKL or AMD AMCL; The speed gains can be noticeable. For Debian/Ubuntu, one can simply install one of the atlas-base-* packages. *use of GPUs (’graphics programming units’); GPUs are essentially hardware that is optimised for I/O and floating point operations, leading to much faster code execution than standard CPUs on floating-point operations. The key development environments that are available are Nvidia CUDA (Compute Unified Device Architecture) introduced in 2007 and provides C-like programming OpenCL (Open Computing Language) introduced in 2009 provides a vendor-independent interface to GPU hardware. *use of compiled code: inline for automated wrapping of simple expression Rcpp for easing the interface between R and C++ *use of implicit parallel programming (multicore with mclapply and parallel/collect ,); pnmath uses OpenMP compiler directives; pnmath0 uses pthreads to implements the same interface. *use of explicit parallel programming: Several CRAN (or R-Forge) packages provide the ability to : execute R code in parallel: *NWS ("NetWorkSpaces") is a simple alternative to MPI (see below). It is based on Python and cross-platform, and originates with one of the predecessor companies to Revolution Analytics. NWS is accessible from R, Python, Matlab, Ruby, and other languages. *Rmpi is a CRAN package that provides an interface between R and the Message Passing Interface (MPI), a standard for parallel computing. allows us to use MPI directly from R and comes with several examples. It can also be used as a building block for higher-level suage via snow or doMPI/foreach. *snow (using MPI, PVM, NWS or sockets), also snowFT : It can be used to initialize and use a compute cluster using one of the available methods direct socket connections, MPI, PVM, or NWS. *snowfall *multicore *parallel *foreach with doMC, doSNOW, doMPI, doRedis, *plus others (rpvm, papply, taskPR : : :) snow packages *Hadoop and packages like RHIPE *different parallel computing approaches like Rdsm using distributed shared memory Two CRAN packages ease the analysis of large datasets. *''ff'' which maps R objects to files and is therefore only bound by the available filesystem space *''bigmemory'' which maps R objects to dynamic memory objects not managed by R Showing your application / create tutorials Wink is a Tutorial and Presentation creation software, primarily aimed at creating tutorials on how to use software (like a tutor for MS-Word/Excel etc). Using Wink you can capture screenshots, add explanations boxes, buttons, titles etc and generate a highly effective tutorial for your users Documenting your codes An interesting way to document your programs is to use the ReStructured Text (ReST or RST). A list of ReST support tools is available here. Some tutorial about how to write documents in ReST is available here. Hadoop Installing on DEBIAN (tested on sid but should work on wheezy too) Go to https://beagle.whoi.edu/redmine/projects/ibt/wiki/Installing_Hadoop_on_Debianfor more details or if you run into trouble trying the following. add the following two lines to /etc/apt/sources.list deb http://archive.cloudera.com/debian squeeze-cdh3 contrib deb-src http://archive.cloudera.com/debian squeeze-cdh3 contrib install Cloudera's repo key curl -s http://archive.cloudera.com/debian/archive.key | sudo apt-key add - sudo apt-get update Create hadoop user adduser --ingroup hadoop hduser For convenience add these to ~hduser/.bashrc: export PATH=$PATH:/usr/lib/hadoop/bin Establish password-less login for hduser. Note that openssh-server needs to be installed and running on those nodes. su - hduser ssh-keygen -t rsa -P "" cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys Run an example job. Now to run an example job. Grab some plain-text files full of English text and put them in some scratch directory (say ~/test). hadoop dfs -copyFromLocal ~/test ~/hduser hadoop dfs -ls ~/hduser Test using the provided "word count" example: hadoop jar /usr/lib/hadoop/hadoop-examples.jar wordcount ~/hduser ~/out VIew the output: hadoop fs -cat ~/out/part-r-00000 RHadoop Installing RHadoop on DEBIAN sid Based on http://girishkathalagiri.blogspot.be/2012/09/installing-r-in-hadoop-rhadoop.html If you use debian sid, you can install R 2.15.2 just by typing apt-get install r-base Install git apt-get install git-core Let's install RHadoop, in particular rmr2 git clone git://github.com/RevolutionAnalytics/RHadoop.git R CMD build RHadoop/rmr2/pkg/ Check you have all dependencies for getting it working R CMD check rmr2_2.0.1.tar.gz In my case I had to install some stuff: apt-get install libcurl4-openssl-dev libghc-quickcheck-dev install.packages(c('RCurl','Rcpp','RJSONIO','itertools','digest','functional','httr')) Then install quickcheck (package from RHadoop too) R CMD build RHadoop/quickcheck A long list of tests should all end up with "OK" tags. If not. In my case I still had this message: File ‘/root/rmr2.Rcheck/rmr2/libs/rmr2.so’: Found ‘_ZSt4cerr’, possibly from ‘std::cerr’ (C++) Object: ‘typed-bytes.o’ Compiled code should not call functions which might terminate R nor write to stdout/stderr instead of to the console. See ‘Writing portable packages’ in the ‘Writing R Extensions’ manual. * checking examples ... ERROR Running examples in ‘rmr2-Ex.R’ failed The error most likely occurred in: > ### Name: big.data.object > ### Title: The big data object. > ### Aliases: big.data.object > > ### ** Examples > > some.big.data = to.dfs(1:10) Error in hadoop.streaming() : Please make sure that the env. variable HADOOP_STREAMING or HADOOP_HOME are set Calls: to.dfs -> system -> paste -> hadoop.streaming Execution halted This indicates the program cannot find the hadoop.streaming jar... besides we need to set the env. variable HADOOP_STREAMING or HADOOP_HOME. ln -s /usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2-cdh3u5.jar /usr/lib/hadoop/hadoop-streaming.jar And add the env. variables: export HADOOP_HOME=/usr/lib/hadoop export HADOOP_CONF=/etc/hadoop/conf export HADOOP_STREAMING=/usr/lib/hadoop/hadoop-streaming.jar Check again whether rmr2 can be installed on your system. Don't forget to refresh your ~/.bashrc first source ~/.bashrc R CMD check rmr2_2.0.1.tar.gz Once all checks are OK, you can install rmr2. R CMD INSTALL rmr2_2.0.1.tar.gz